ArielGlenn has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/348153 )

Change subject: last page range for page content job would sometimes have too 
many revs
......................................................................

last page range for page content job would sometimes have too many revs

Now we continue iterating by revcount amounts until we reach the last page
to be dumped.

Change-Id: Id8832d628a49026da9e7f4ea17548ed340e191cd
---
M xmldumps-backup/dumps/pagerange.py
1 file changed, 13 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/dumps 
refs/changes/53/348153/1

diff --git a/xmldumps-backup/dumps/pagerange.py 
b/xmldumps-backup/dumps/pagerange.py
index 56e6da6..b21b999 100644
--- a/xmldumps-backup/dumps/pagerange.py
+++ b/xmldumps-backup/dumps/pagerange.py
@@ -203,12 +203,16 @@
             estimate = self.qrunner.get_estimate(page_start, page_end)
             revs_for_range = self.get_revcount(int(page_start), int(page_end), 
estimate)
             numjobs = revs_for_range / numrevs + 1
-        for jobnum in range(1, numjobs + 1):
-            if jobnum == numjobs:
-                # last job, don't bother searching. just append up to max page 
id
-                ranges.append((str(page_start), str(page_end)))
-                break
+        jobnum = 1
+        while True:
+            jobnum += 1
             numjobs_left = numjobs - jobnum + 1
+            if numjobs_left <= 0:
+                # our initial count was a bit off, and we'll have more jobs
+                # than we thought. just keep passing the same endpoint
+                # and getting ranges until we've gotten up through
+                # the endpoint returned
+                numjobs_left = 1
             interval = (page_end - page_start) / numjobs_left + 1
             (start, end) = self.get_pagerange(page_start, numrevs,
                                               page_start + interval, prevguess)
@@ -240,10 +244,10 @@
         maxtodo = 50000
 
         runstodo = estimate / maxtodo + 1
-        # let's say minimum pages per job is 10, that's
+        # let's say minimum pages per job is 1, that's
         # quite reasonable (in the case where some pages
         # have many many revisions
-        step = ((page_end - page_start) / runstodo) + 10
+        step = ((page_end - page_start) / runstodo) + 1
         ends = range(page_start, page_end, step)
 
         if ends[-1] != page_end:
@@ -287,8 +291,8 @@
             if not interval:
                 return (page_start, badguess)
 
-            # set 10 pages as an absolute minimum in a query
-            if badguess - page_start <= 10:
+            # set 1 page as an absolute minimum in a query
+            if badguess - page_start <= 1:
                 return (page_start, badguess)
 
             prevguess = badguess

-- 
To view, visit https://gerrit.wikimedia.org/r/348153
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id8832d628a49026da9e7f4ea17548ed340e191cd
Gerrit-PatchSet: 1
Gerrit-Project: operations/dumps
Gerrit-Branch: ariel
Gerrit-Owner: ArielGlenn <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to