| Dvorapa added a comment. |
I don't understand what is wrong. For me everything is working with pages you listed. The whole code to test (you have to import _get_regexes too):
def load_page(self): """Load the page to be archived and break it up into threads.""" self.header = '' self.threads = [] self.archives = {} self.archived_threads = 0 text = self.get() # Replace text in following exceptions by spaces, but don't change line # numbers exceptions = ['comment', 'code', 'pre', 'source', 'nowiki'] exc_regexes = _get_regexes(exceptions, self.site) stripped_text = text for regex in exc_regexes: for match in re.finditer(regex, stripped_text): before = stripped_text[:match.start()] restricted = stripped_text[match.start():match.end()] after = stripped_text[match.end():] restricted = re.sub(r'[^\n]', r'', restricted) stripped_text = before + restricted + after # Find thread headers in stripped text and return their line numbers stripped_lines = stripped_text.split('\n') thread_headers = [] for line_number, line in enumerate(stripped_lines, start=1): if re.search(r'^== *[^=].*? *== *$', line): thread_headers.append(line_number) # Fill self by original thread headers on returned line numbers lines = text.split('\n') found = False # Reading header cur_thread = None for line_number, line in enumerate(lines, start=1): if line_number in thread_headers: thread_header = re.search('^== *([^=].*?) *== *$', line) found = True # Reading threads now if cur_thread: self.threads.append(cur_thread) cur_thread = DiscussionThread(thread_header.group(1), self.now, self.timestripper) else: if found: cur_thread.feed_line(line) else: self.header += line + '\n' if cur_thread: self.threads.append(cur_thread) # This extra info is not desirable when run under the unittest # framework, which may be run either directly or via setup.py if pywikibot.calledModuleName() not in ['archivebot_tests', 'setup']: pywikibot.output(u'%d Threads found on %s' % (len(self.threads), self))
Latest pywikibot commit I've tested this on (with no modifications):
24617f35d693b65e4d9bf14755dae8af835390edCommand I've used for testing:
$ python pwb.py archivebot -lang:hu User:Cherybot/config -page:"..." -force -user:Dvorapa Processing [[hu:...]] xy Threads found on [[hu:...]] Looking for: {{Szerkesztő:Cherybot/config}} in [[hu:...]] Processing xy threads Archiving yx thread(s). Page [[...]] savedIf you use customized version, please be sure you ported the whole new functionality into your code (you can compare what was changed in https://gerrit.wikimedia.org/r/#/c/397803/11/scripts/archivebot.py):
New functionality you should port:... from pywikibot.textlib import _get_regexes ... ... text = self.get() # Replace text in following exceptions by spaces, but don't change line # numbers exceptions = ['comment', 'code', 'pre', 'source', 'nowiki'] exc_regexes = _get_regexes(exceptions, self.site) stripped_text = text for regex in exc_regexes: for match in re.finditer(regex, stripped_text): before = stripped_text[:match.start()] restricted = stripped_text[match.start():match.end()] after = stripped_text[match.end():] restricted = re.sub(r'[^\n]', r'', restricted) stripped_text = before + restricted + after # Find thread headers in stripped text and return their line numbers stripped_lines = stripped_text.split('\n') thread_headers = [] for line_number, line in enumerate(stripped_lines, start=1): if re.search(r'^== *[^=].*? *== *$', line): thread_headers.append(line_number) # Fill self by original thread headers on returned line numbers lines = text.split('\n') found = False # Reading header cur_thread = None for line_number, line in enumerate(lines, start=1): if line_number in thread_headers: thread_header = re.search('^== *([^=].*?) *== *$', line) found = True # Reading threads now if cur_thread: self.threads.append(cur_thread) cur_thread = DiscussionThread(thread_header.group(1), self.now, self.timestripper) else: if found: cur_thread.feed_line(line) else: self.header += line + '\n' ...And if your customized version fixes T72249, please help pywikibot and submit a patch
TASK DETAIL
EMAIL PREFERENCES
To: Dvorapa
Cc: Xqt, gerritbot, Ato_01, Tacsipacsi, revi, Dvorapa, Aklapper, jeblad, Ghouston, whym, pywikibot-bugs-list, Cpaulf30, Baloch007, Darkminds3113, Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, Lewizho99, Maathavan
Cc: Xqt, gerritbot, Ato_01, Tacsipacsi, revi, Dvorapa, Aklapper, jeblad, Ghouston, whym, pywikibot-bugs-list, Cpaulf30, Baloch007, Darkminds3113, Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, Lewizho99, Maathavan
_______________________________________________ pywikibot-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs
