Re: [Pywikipedia-l] Feature request for pagegenerators.RegexFilterPageGenerator

Dr. Trigon Fri, 08 Oct 2010 01:26:43 -0700

Have to rise this again. Was testing the new revision and have some
issues. I need the possibility to compare against page.title() instead
of page.titleWithoutNamespace(). Additionally it would be great if the
regex could also be passed in a compiled representation (but this is
optional).


Please look at the attached patch...

Thanks a lot and greetings
DrTrigon


Am 05.10.2010 18:24, schrieb [email protected]:

Done in r8609. Thanks


----- Original Nachricht ----
Von:     "Dr. Trigon"<[email protected]>
An:      Pywikipedia discussion list<[email protected]>
Datum:   19.09.2010 00:06
Betreff: [Pywikipedia-l] Feature request for
        pagegenerators.RegexFilterPageGenerator

Hello all

I'd like to suggest the code change given by the attached patch.

The idea is to change RegexFilterPageGenerator a little bit. First
change the 'regex' param to a list of regex(es) instead of 1 single.
The whole list of regex will be checked for a positive match. The
second change involves a new parameter 'invert' which, if set to
True changes the generator from returning pages on ANY POSITIVE match
to return page on NO POSITIVE match AT ALL. This way a positive
(additive) and negative (subtractive) filter behaviour can be achieved.

This would also be very helpful for my bot... ;)

Thanks a lot and greetings
DrTrigon


--------------------------------

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Index: pagegenerators.py
===================================================================
--- pagegenerators.py   (Revision 8620)
+++ pagegenerators.py   (Arbeitskopie)
@@ -1160,25 +1160,36 @@
             seenPages[_page] = True
             yield page
 
-def RegexFilterPageGenerator(generator, regex, inverse=False):
+def RegexFilterPageGenerator(generator, regex, inverse=False, 
ignore_namespace=True):
     """
     Wraps around another generator. Yields only those pages, the titles of
     which are positively matched to any regex in list. If invert is False,
     yields all pages matched by any regex, if True, yields all pages matched
-    none of the regex.
+    none of the regex. If ignore_namespace is False, the whole page title
+    is compared.
 
     """
     # test for backwards compatibility
     if isinstance(regex, basestring):
         regex = [regex]
-    reg = [ re.compile(r, re.I) for r in regex ]
+    # test if regex is already compiled
+    if isinstance(regex[0], basestring):
+        reg = [ re.compile(r, re.I) for r in regex ]
+    else:
+        reg = regex
 
     for page in generator:
+        # get the page title
+        if ignore_namespace:
+            title = page.titleWithoutNamespace()
+        else:
+            title = page.title()
+
         if inverse:
             # yield page if NOT matched by all regex
             skip = False
             for r in reg:
-                if r.match(page.titleWithoutNamespace()):
+                if r.match(title):
                     skip = True
                     break
             if not skip:
@@ -1186,7 +1197,7 @@
         else:
             # yield page if matched by any regex
             for r in reg:
-                if r.match(page.titleWithoutNamespace()):
+                if r.match(title):
                     yield page
                     break

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Re: [Pywikipedia-l] Feature request for pagegenerators.RegexFilterPageGenerator

Reply via email to