Re: RegExp problems

jorrit787 Wed, 09 Mar 2016 17:30:20 -0800

Your improvements work great, thank you. And thank you for the very 
detailed explanations!


On Tuesday, March 8, 2016 at 9:41:11 AM UTC+1, Michal Petrucha wrote:
>
> On Mon, Mar 07, 2016 at 05:44:08PM -0800, [email protected] <javascript:> 
> wrote: 
> > I'm trying to replace *[URL]www.link.com[/URL]* with HTML with this 
> regexp: 
> > 
> > topic.text = re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', 
> topic 
> > .text, flags=re.I) 
> > 
> > But it's giving me the following problems: 
> > 
> >    1. The $2 capture group is only able to be repeated once, so I get 
> >    <a href="www.link.com">$2</a> 
> >    instead of 
> >    <a href="www.link.com">www.link.com</a> 
>
> I have my doubts – if you use the standard Python re library, then the 
> way to refer to captured groups is "\1", "\2", etc., not "$1". When I 
> try the code you posted above, I get the following result (i.e., not 
> even the first occurrence of "$2" gets substituted):: 
>
>     >>> re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', '[URL]
> www.link.com[/URL]', flags=re.I) 
>     '<a href="$2">$2</a>' 
>
> In order to make the substitution work for a single occurrence of 
> [URL]...[/URL], you can use the following, which uses "\2" (Also, when 
> writing regular expressions, or other strings that are supposed to 
> contain the backslash character, it is a good idea to write them as 
> raw string literals, i.e. prefix them with a "r", which I've done 
> below; that way, Python won't try to interpret the backslashes as 
> special characters – otherwise, "\2" would become a character with an 
> ASCII value of 2):: 
>
>     >>> re.sub(r"(\[URL\])(.*)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]
> www.link.com[/URL]', flags=re.I) 
>     '<a href="www.link.com">www.link.com</a>' 
>
> >    2. Only the first *[URL]* is matched. Everything after the first 
> *[/URL]* 
> >    is simply deleted... 
>
> The solution above gets you halfway there – re.sub will replace all 
> matches by default, the problem here is that the "(.*)" part of your 
> regex will matches everything between the first "[URL]", and the last 
> "[/URL]":: 
>
>     >>> re.sub(r"(\[URL\])(.*)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]
> www.link1.com[/URL][URL]www.link2.com[/URL][URL]www.link3.com[/URL]', 
> flags=re.I) 
>     '<a href="www.link1.com[/URL][URL]www.link2.com[/URL][URL]
> www.link3.com">www.link1.com[/URL][URL]www.link2.com[/URL][URL]
> www.link3.com</a>' 
>
> The reason is that the asterisk operator in a regex is greedy, which 
> means a ".*" will try to match as much as possible. When you use the 
> non-greedy version of the operator (which you get by putting a 
> question mark after the asterisk), you get the result you want:: 
>
>     >>> re.sub(r"(\[URL\])(.*?)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]
> www.link1.com[/URL][URL]www.link2.com[/URL][URL]www.link3.com[/URL]', 
> flags=re.I) 
>     '<a href="www.link1.com">www.link1.com</a><a href="www.link2.com">
> www.link2.com</a><a href="www.link3.com">www.link3.com</a>' 
>
>
> You can read an explanation of the difference between greedy and 
> non-greedy regular expressions in the Python docs: 
> https://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy 
>
> Good luck, 
>
> Michal 
>
> >     
> > I hope someone can help me with this. I'm using Python 2.7 if it makes a 
> > difference. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "Django users" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > Visit this group at https://groups.google.com/group/django-users. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/django-users/fce5a726-8a4c-455a-a978-6ee70d66464e%40googlegroups.com.
>  
>
> > For more options, visit https://groups.google.com/d/optout. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/6d3e0a68-ec36-4a7a-bcb5-c57a775e8e59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: RegExp problems

Reply via email to