[Pywikipedia-bugs] [Maniphest] [Updated] T182496: archivebot should ignore section headers within 'nowiki' segments (and commented out segments)

Dvorapa Thu, 28 Dec 2017 05:48:39 -0800

Dvorapa added a comment.

I don't understand what is wrong. For me everything is working with pages you listed. The whole code to test (you have to import _get_regexes too):

def load_page(self):
    """Load the page to be archived and break it up into threads."""
    self.header = ''
    self.threads = []
    self.archives = {}
    self.archived_threads = 0
    text = self.get()
    # Replace text in following exceptions by spaces, but don't change line
    # numbers
    exceptions = ['comment', 'code', 'pre', 'source', 'nowiki']
    exc_regexes = _get_regexes(exceptions, self.site)
    stripped_text = text
    for regex in exc_regexes:
        for match in re.finditer(regex, stripped_text):
            before = stripped_text[:match.start()]
            restricted = stripped_text[match.start():match.end()]
            after = stripped_text[match.end():]
            restricted = re.sub(r'[^\n]', r'', restricted)
            stripped_text = before + restricted + after
    # Find thread headers in stripped text and return their line numbers
    stripped_lines = stripped_text.split('\n')
    thread_headers = []
    for line_number, line in enumerate(stripped_lines, start=1):
        if re.search(r'^== *[^=].*? *== *$', line):
            thread_headers.append(line_number)
    # Fill self by original thread headers on returned line numbers
    lines = text.split('\n')
    found = False  # Reading header
    cur_thread = None
    for line_number, line in enumerate(lines, start=1):
        if line_number in thread_headers:
            thread_header = re.search('^== *([^=].*?) *== *$', line)
            found = True  # Reading threads now
            if cur_thread:
                self.threads.append(cur_thread)
            cur_thread = DiscussionThread(thread_header.group(1), self.now,
                                          self.timestripper)
        else:
            if found:
                cur_thread.feed_line(line)
            else:
                self.header += line + '\n'
    if cur_thread:
        self.threads.append(cur_thread)
    # This extra info is not desirable when run under the unittest
    # framework, which may be run either directly or via setup.py
    if pywikibot.calledModuleName() not in ['archivebot_tests', 'setup']:
        pywikibot.output(u'%d Threads found on %s'
                         % (len(self.threads), self))


Latest pywikibot commit I've tested this on (with no modifications):

24617f35d693b65e4d9bf14755dae8af835390ed

Command I've used for testing:

$ python pwb.py archivebot -lang:hu User:Cherybot/config -page:"..." -force -user:Dvorapa
Processing [[hu:...]]
xy Threads found on [[hu:...]]
Looking for: {{Szerkesztő:Cherybot/config}} in [[hu:...]]
Processing xy threads
Archiving yx thread(s).
Page [[...]] saved

If you use customized version, please be sure you ported the whole new functionality into your code (you can compare what was changed in https://gerrit.wikimedia.org/r/#/c/397803/11/scripts/archivebot.py):

New functionality you should port:
...
from pywikibot.textlib import _get_regexes
...

...
        text = self.get()
        # Replace text in following exceptions by spaces, but don't change line
        # numbers
        exceptions = ['comment', 'code', 'pre', 'source', 'nowiki']
        exc_regexes = _get_regexes(exceptions, self.site)
        stripped_text = text
        for regex in exc_regexes:
            for match in re.finditer(regex, stripped_text):
                before = stripped_text[:match.start()]
                restricted = stripped_text[match.start():match.end()]
                after = stripped_text[match.end():]
                restricted = re.sub(r'[^\n]', r'', restricted)
                stripped_text = before + restricted + after
        # Find thread headers in stripped text and return their line numbers
        stripped_lines = stripped_text.split('\n')
        thread_headers = []
        for line_number, line in enumerate(stripped_lines, start=1):
            if re.search(r'^== *[^=].*? *== *$', line):
                thread_headers.append(line_number)
        # Fill self by original thread headers on returned line numbers
        lines = text.split('\n')
        found = False  # Reading header
        cur_thread = None
        for line_number, line in enumerate(lines, start=1):
            if line_number in thread_headers:
                thread_header = re.search('^== *([^=].*?) *== *$', line)
                found = True  # Reading threads now
                if cur_thread:
                    self.threads.append(cur_thread)
                cur_thread = DiscussionThread(thread_header.group(1), self.now,
                                              self.timestripper)
            else:
                if found:
                    cur_thread.feed_line(line)
                else:
                    self.header += line + '\n'
...

And if your customized version fixes T72249, please help pywikibot and submit a patch

TASK DETAIL

https://phabricator.wikimedia.org/T182496

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Dvorapa
Cc: Xqt, gerritbot, Ato_01, Tacsipacsi, revi, Dvorapa, Aklapper, jeblad, Ghouston, whym, pywikibot-bugs-list, Cpaulf30, Baloch007, Darkminds3113, Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, Lewizho99, Maathavan

_______________________________________________
pywikibot-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs

[Pywikipedia-bugs] [Maniphest] [Updated] T182496: archivebot should ignore section headers within 'nowiki' segments (and commented out segments)

Reply via email to