On Sat, Dec 24, 2011 at 20:41, Dr. Trigon <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I do know that heading or section recognition inside the framework was > mostly (e.g. archive bot) done by using regex... I myself felt always > that it is not reliable since there are a lot of odd possible > situations. That's true - the regex solution that I gave works sometimes, but sometimes it still matches inside headers. Don't know why - haven't debugged it yet. > Thus I wrote an 'getSections' method for DrTrigonBot but > I am not aware if this could be of any use for you... > > Anyway feel free to have a look at it and use it if you like... ;) > > > https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wikipedia.py?hb=true#to122 > Hmm... it's above me, as I don't speak Python. Not sure how to use it. :-( Thanks anyway! Chris > Greetings > > On 22.12.2011 09:18, Chris Watkins wrote: > > I just worked it out, mostly... instead of: > > -exceptinsidetag:header > > > > I used: -exceptinside:'=[^\n\r]*=[ \t]*' > > > > And it worked! > > > > There might be a small risk of false positives, so I tried various > > tweaks, e.g. -exceptinside:'^=[^\n\r]*=[ \t]*$' > > -exceptinside:'[\n\r]=[^\n\r]*=[ \t]*[\n\r]' > > -exceptinside:'[\n\r]=[^\n\r]*=' > > > > But none worked... any suggestions? > > > > On Thu, Dec 22, 2011 at 18:21, Chris Watkins > > <[email protected] > > <mailto:[email protected]>> wrote: > > > > I have been using " -exceptinsidetag:header" with replace.py. This > > was added by Daniel Herding in response to a request by me: > > > > On Mon, Jun 30, 2008 at 23:11, Daniel Herding <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > > > This will exclude wikilinks and URLs. There are some more things > > that can be excluded, see the source code of the method > > replaceExcept() in wikipedia.py (look at the exceptionRegexes > > dictionary). I have just added a regular expression for section > > headers for you, so if you're running the SVN version, you can use > > this parameter: > > > > -exceptinsidetag:header > > > > > > > > I seem to recall this working in a nightly version a couple of > > years ago, but it's not working now - I'm not sure when it stopped. > > Is it possible to put it back in? > > > > Thanks! > > > > > > -- Chris Watkins > > > > Appropedia.org - Sharing knowledge to build rich, sustainable > > lives. > > > > > > > > > > -- Chris Watkins > > > > Appropedia.org - Sharing knowledge to build rich, sustainable > > lives. > > > > > > > > _______________________________________________ Pywikipedia-l > > mailing list [email protected] > > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk71ni0ACgkQAXWvBxzBrDBEKQCgwDB6gNylbEgXPxfld1M7sAhL > 9XUAoIhYypqoyM3FzUCNSgJ7bT+6QLoj > =yxc+ > -----END PGP SIGNATURE----- > > _______________________________________________ > Pywikipedia-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > -- Chris Watkins Appropedia.org - Sharing knowledge to build rich, sustainable lives.
_______________________________________________ Pywikipedia-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
