On 18Apr2012 23:11, Sania <fantasyblu...@gmail.com> wrote: | So I am trying to get the number of casualties in a text. After 'death | toll' in the text the number I need is presented as you can see from | the variable called text. Here is my code | I'm pretty sure my regex is correct, I think it's the group part | that's the problem. | I am using nltk by python. Group grabs the string in parenthesis and | stores it in deadnum and I make deadnum into a list. | | text="accounts put the death toll at 637 and those missing at | 653 , but the total number is likely to be much bigger"
I presume you want the 637 and not the 653. | dead=re.match(r".*death toll.*(\d[,\d\.]*)", text) I always feel a little uncomfortable about double quotes and backslashes (for all that the above is a "raw" string). Too much shell and C programming perhaps. Anyway... I would break this up like this: re_DEATH_TOLL = r".*death toll.*(\d[,\d\.]*)" print >>sys.stderr, "re_DEATH_TOLL =", re_DEATH_TOLL dead=re.match(re_DEATH_TOLL, text) so I can print the raw text of the regexp _after_ python has parsed the string. Secondly, your regexp will match the wrong number, based on my presumption above. Regexps are greedy and so your second ".*" will match as much as possible while still matching the rest of the regexp. ANd therefore if will match all the text before the 653, and grab the wrong number. Try (raw regexp): death toll\D*(\d+) or death toll\D*(\d[\d,.]*) and also use re.find instead of re.match; re.find will find the first match anywhere in the string, avoiding complicating the regexp with a leading ".*". \D is a non-digit. "+" means one or more like "*" means zero or more. Cheers -- Cameron Simpson <c...@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ I'm not weird; I'm gifted. -- http://mail.python.org/mailman/listinfo/python-list