Have to rise this again. Was testing the new revision and have some
issues. I need the possibility to compare against page.title() instead
of page.titleWithoutNamespace(). Additionally it would be great if the
regex could also be passed in a compiled representation (but this is
optional).
Please look at the attached patch...
Thanks a lot and greetings
DrTrigon
Am 05.10.2010 18:24, schrieb [email protected]:
Done in r8609. Thanks
----- Original Nachricht ----
Von: "Dr. Trigon"<[email protected]>
An: Pywikipedia discussion list<[email protected]>
Datum: 19.09.2010 00:06
Betreff: [Pywikipedia-l] Feature request for
pagegenerators.RegexFilterPageGenerator
Hello all
I'd like to suggest the code change given by the attached patch.
The idea is to change RegexFilterPageGenerator a little bit. First
change the 'regex' param to a list of regex(es) instead of 1 single.
The whole list of regex will be checked for a positive match. The
second change involves a new parameter 'invert' which, if set to
True changes the generator from returning pages on ANY POSITIVE match
to return page on NO POSITIVE match AT ALL. This way a positive
(additive) and negative (subtractive) filter behaviour can be achieved.
This would also be very helpful for my bot... ;)
Thanks a lot and greetings
DrTrigon
--------------------------------
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Index: pagegenerators.py
===================================================================
--- pagegenerators.py (Revision 8620)
+++ pagegenerators.py (Arbeitskopie)
@@ -1160,25 +1160,36 @@
seenPages[_page] = True
yield page
-def RegexFilterPageGenerator(generator, regex, inverse=False):
+def RegexFilterPageGenerator(generator, regex, inverse=False,
ignore_namespace=True):
"""
Wraps around another generator. Yields only those pages, the titles of
which are positively matched to any regex in list. If invert is False,
yields all pages matched by any regex, if True, yields all pages matched
- none of the regex.
+ none of the regex. If ignore_namespace is False, the whole page title
+ is compared.
"""
# test for backwards compatibility
if isinstance(regex, basestring):
regex = [regex]
- reg = [ re.compile(r, re.I) for r in regex ]
+ # test if regex is already compiled
+ if isinstance(regex[0], basestring):
+ reg = [ re.compile(r, re.I) for r in regex ]
+ else:
+ reg = regex
for page in generator:
+ # get the page title
+ if ignore_namespace:
+ title = page.titleWithoutNamespace()
+ else:
+ title = page.title()
+
if inverse:
# yield page if NOT matched by all regex
skip = False
for r in reg:
- if r.match(page.titleWithoutNamespace()):
+ if r.match(title):
skip = True
break
if not skip:
@@ -1186,7 +1197,7 @@
else:
# yield page if matched by any regex
for r in reg:
- if r.match(page.titleWithoutNamespace()):
+ if r.match(title):
yield page
break
_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l