On Fri, Nov 16, 2012 at 12:28 AM, <krishna.k.kish...@gmail.com> wrote: > Can someone explain the below behavior please? > >>>> re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}') >>>> re.findall(re_obj,'1000,1020,1000') > ['1000'] >>>> re.findall(re_obj,'1000,1020, 1000') > ['1020', '1000']
Try removing the grouping parentheses to see the full strings being matched: >>> re1 = re.compile(r'(?:(?:1000|1010|1020)[ ]*?[\,]?[ ]*?){1,3}') >>> re.findall(re1,'1000,1020,1000') ['1000,1020,1000'] >>> re.findall(re1,'1000,1020, 1000') ['1000,1020,', '1000'] In the first case, the regular expression is matching the full string. It could also match shorter expressions, but as only the space quantifiers are non-greedy and there are no spaces to match anyway, it does not. With the grouping parentheses in place, only the *last* value of the group is returned, which is why you only see the last '1000' instead of all three strings in the group, even though the group is actually matching three different substrings. In the second case, the regular expression finds first the '1000,1020' and then the '1000' as two separate matches. The reason for this is the space. Since the quantifier on the space is non-greedy, it first tries *not* matching the space, finds that it has a valid match, and so does not backtrack. The '1000' is then identified as a separate match. As before, with the grouping parentheses in place you see only the '1020' and the last '1000' because the group only reports the last substring it matched for that particular match. > However when I use "[\,]??" instead of "[\,]?" as below, I see a different > result >>>> re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}') >>>> re.findall(re_obj,'1000,1020,1000') > ['1000', '1020', '1000'] > > I am not able to understand what's causing the difference of behavior here, I > am assuming it's not 'greediness' if "?" The difference is the non-greediness of the comma quantifier. When it comes time for it to match the comma, because the quantifier is non-greedy, it first tries *not* matching the comma, whereas before it first tried to match it. As with the space above, when the comma is not matched, it finds that it has a valid match anyway if it just stops matching immediately. So it does not need to backtrack, and in this case it ends up terminating each match early upon the comma and returning all three numbers as separate matches. What exactly is it that you're trying to do with this regular expression? I suspect that it the solution actually a lot simpler than you're making it. -- http://mail.python.org/mailman/listinfo/python-list