ArielGlenn has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/398861 )

Change subject: config setting to permit a list of wikis to be dumped in a 
specific order
......................................................................


config setting to permit a list of wikis to be dumped in a specific order

We don't want these for all wikis but it can be useful for small lists.

Example the big wikis that dump via 4 processes at a time.  Some of these
take days longer than others; if we start those first, multiple other wikis
will run to completion on other cores, while the first ones chug along.
If the order is by longet to dump, these big wikis often wind up
starting near the end ofthe list, and run along by themselves after
everything else is completed.

We can't really just expand the configs of these big slow wikis
so they use a pile more processors at once, that's not ok for the db
servers. (enwiki, wikidatawiki, yes, but not the rest.) So, move them
to the front of the queue.

Change-Id: I494ed57363b1ddfe0e10be0aed25facb7ca8a364
---
M xmldumps-backup/defaults.conf
M xmldumps-backup/doc/README.config
M xmldumps-backup/dumps/WikiDump.py
M xmldumps-backup/dumps/utils.py
M xmldumps-backup/worker.py
5 files changed, 44 insertions(+), 9 deletions(-)

Approvals:
  ArielGlenn: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/xmldumps-backup/defaults.conf b/xmldumps-backup/defaults.conf
index d17cbbb..e95ac47 100644
--- a/xmldumps-backup/defaults.conf
+++ b/xmldumps-backup/defaults.conf
@@ -76,3 +76,6 @@
 orderrevs=0
 minpages=1
 maxrevs=50000
+
+[misc]
+fixeddumporder=0
\ No newline at end of file
diff --git a/xmldumps-backup/doc/README.config 
b/xmldumps-backup/doc/README.config
index d4d76f4..6ccd705 100644
--- a/xmldumps-backup/doc/README.config
+++ b/xmldumps-backup/doc/README.config
@@ -245,6 +245,15 @@
 The above options do not have to be specified in the config file,
 since default values are provided.
 
+=== Misc (i.e.: [misc])
+fixed_dump_order -- set this to a non-zero integer to enable dumps
+                of wikis in the specified db list to be dumped
+                in the order listed
+               Default value: 0 (wiki dumped longest ago goes first)
+
+The above options do not have to be specified in the config file,
+since default values are provided.
+
 === Per-wiki configuration
 The following settings may be overriden for specific wikis by specifying
 their name (the name of the db in the database) as a section header,
diff --git a/xmldumps-backup/dumps/WikiDump.py 
b/xmldumps-backup/dumps/WikiDump.py
index 80115ec..8ba3838 100644
--- a/xmldumps-backup/dumps/WikiDump.py
+++ b/xmldumps-backup/dumps/WikiDump.py
@@ -171,9 +171,8 @@
         globals like entries in 'wiki' or 'output' that can
         be overriden by a specific named section
         """
-        self.db_list = MiscUtils.db_list(self.get_opt_in_overrides_or_default(
-            "wiki", "dblist", 0))
-
+        self.db_list_unsorted = 
MiscUtils.db_list(self.get_opt_in_overrides_or_default(
+            "wiki", "dblist", 0), nosort=True)
         # permit comma-separated list of files so that eg some script
         # can skip all private and/or closed wikis in addition to some
         # other exclusion list
@@ -191,7 +190,9 @@
         self.apijobs = self.get_opt_in_overrides_or_default(
             "wiki", "apijobs", 0)
 
-        self.db_list = list(set(self.db_list) - set(self.skip_db_list))
+        self.db_list_unsorted = [dbname for dbname in self.db_list_unsorted
+                                 if dbname not in self.skip_db_list]
+        self.db_list = sorted(self.db_list_unsorted)
 
         if not self.conf.has_section('output'):
             self.conf.add_section('output')
@@ -206,6 +207,11 @@
         self.fileperms = self.get_opt_in_overrides_or_default("output", 
"fileperms", 0)
         self.fileperms = int(self.fileperms, 0)
 
+        if not self.conf.has_section('misc'):
+            self.conf.add_section('misc')
+        self.fixed_dump_order = self.get_opt_in_overrides_or_default("misc", 
"fixeddumporder", 0)
+        self.fixed_dump_order = int(self.fixed_dump_order, 0)
+
     def parse_conffile_globally(self):
 
         if not self.conf.has_section('database'):
diff --git a/xmldumps-backup/dumps/utils.py b/xmldumps-backup/dumps/utils.py
index c782931..9c2f040 100644
--- a/xmldumps-backup/dumps/utils.py
+++ b/xmldumps-backup/dumps/utils.py
@@ -17,7 +17,7 @@
 
 class MiscUtils(object):
     @staticmethod
-    def db_list(path):
+    def db_list(path, nosort=False):
         """Read database list from a file"""
         if not path:
             return []
@@ -28,7 +28,8 @@
             if line != "":
                 dbs.append(line)
         infhandle.close()
-        dbs = sorted(dbs)
+        if not nosort:
+            dbs = sorted(dbs)
         return dbs
 
     @staticmethod
diff --git a/xmldumps-backup/worker.py b/xmldumps-backup/worker.py
index 507de7c..9478452 100644
--- a/xmldumps-backup/worker.py
+++ b/xmldumps-backup/worker.py
@@ -39,6 +39,7 @@
         return True
 
     wiki.set_date(date)
+    wiki.config.parse_conffile_per_project(wiki.db_name)
 
     runner = Runner(wiki, prefetch=prefetch, prefetchdate=prefetchdate, 
spawn=spawn, job=job,
                     skip_jobs=skipjobs, restart=restart, notice=html_notice, 
dryrun=dryrun,
@@ -90,11 +91,26 @@
                         date=None, job=None, skipjobs=None, page_id_range=None,
                         partnum_todo=None, checkpoint_file=None, 
skipdone=False, restart=False,
                         verbose=False):
-    nextdbs = config.db_list_by_age(bystatustime)
-    nextdbs.reverse()
+    # note that fixed_dump_order had better be used only with the skipdone 
option,
+    # otherwise the first wiki in the list will be run over and over :-P
+    # we like this order because we can put one of the "bigwikis" that takes
+    # forever to finish, at the head of the list,letting it take however many 
cores
+    # and be slow, while the rest of the wikis run on the other cores one after
+    # another and finish up.  If we start the slowest one lots later, it might
+    # be the only thing running for several days when the rest of the wikis 
have
+    # already finished, it doesn't expand to use all available cores (this 
would be
+    # too hard on the db servers)
+    if config.fixed_dump_order:
+        nextdbs = config.db_list_unsorted
+    else:
+        nextdbs = config.db_list_by_age(bystatustime)
+        nextdbs.reverse()
 
     if verbose and not cutoff:
-        sys.stderr.write("Finding oldest unlocked wiki...\n")
+        if config.fixed_dump_order:
+            sys.stderr.write("Finding next unlocked wiki in list...\n")
+        else:
+            sys.stderr.write("Finding oldest unlocked wiki...\n")
 
     # if we skip locked wikis which are missing the prereqs for this job,
     # there are still wikis where this job needs to run

-- 
To view, visit https://gerrit.wikimedia.org/r/398861
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I494ed57363b1ddfe0e10be0aed25facb7ca8a364
Gerrit-PatchSet: 2
Gerrit-Project: operations/dumps
Gerrit-Branch: master
Gerrit-Owner: ArielGlenn <[email protected]>
Gerrit-Reviewer: ArielGlenn <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to